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ABSTRACT 

Parallels between sign stimuli and speech cues 
suggest some interesting speculations about the origins of language- 
speech cues may belong to the class of human sign stimuli which, as 
in animal behavior, may be the product of an innate releasing 
mechanism- Prelinguistic speech for man may have functioned as a 
social-releaser system. Human language developed as a result of the 
intellect, which was capable of making a semantic representation of 
the world of experience and the phonetic social- releaser system- 
Linguistic capacity — the ability to learn the grammar of a 
language — was also necessary- Grammar evolved to interrelate the 
semantic product of the intellect and the phonetic product of the 
prelinguistic communication system- References are included. 
(Author/VM) 
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The perception of the linguistic Information In speech, as Investiga- 
tions carried on over the past twenty years have made clear, depends not on 
a general resemblance between presently and previously heard sounds but on a 
quite complex system of acoustic cues which has been called by Liberman et 
al. (1967) the "speech code." These authors suggest that a special percep- 
tual mechanism Is used to detect and decode the speech cues. I wish to draw 
attention here to some Interesting formal parallels between these cues and 
a well-known class of animal signals, "sign stimuli," described by Lorenz, 
Tinbergen, and others. These formal parallels suggest some speculations 
about the original biological function of speech and the related problem 
of the origin of language. 

A speech cue is a specific event in the acoustic stream of speech which 
Is important for the perception of a p netlc distinction. A well-known ex- 
ample Is the second-formant transition, a cue to place of articulation. 

During speech, the formants (l.e., acoustical resonances) of the vocal tract 
vary In frequency from moment to moment depending on the shape and size of the 
tract (Fant, 1960). When the tract Is excited (either by periodic glottal 
pulsing or by noise) these momentary variations can be observed In a sound 
spectrogram. During the transition from a stop consonant, such as [b,d,g,p,k], 
to a following vowel, the second (next to lowest In frequency) formant (F2) 
moves from a frequency appropriate for the stop towards a frequency appropri- 
ate for the vowel; the values of these frequencies depend mainly on the posi- 
tion of the major constriction of the vocal tract In the foinnatlon of each of 
the two sounds. Since there Is no energy in » 40 st or all of the acoustic 
spectrum until after the release of the stop closure, the earlier part of the 
transition will be neither audible nor observable. But the slope of the later 
part, following the release. Is audible and can be observed (see the transi- 
tion for [b] In the spectrogram for [be] in the upper portion of Figure 1). 

It is also a sufficient cue to the place of articulation of the preceding 
stop: labial [b,p], alveolar [d,t], or velar [g,kj. It Is as If the listener, 
given the final part of the F2 transition, could extrapolate back to the con- 
sonantal frequency or locus (Delattre et al. , 1955). 
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It is possible electronically to synthesize speech which Is Intelligible, 
even though It has much simpler spectral structure than natural speech 
(Cooper, 1950; Mattingly, 1968). In the lower portion of Figure 1 Is shown 
a spectrogram of a synthetic version of the syllable [b€]. Synthetic speech 
can be used to demonstrate the value of a cue such as the F2 transition by 
generating a series of stop-vowel syllables for which the slope of the audi- 
ble part of the F2 transition Is the only variable, and other cues to posi- 
tion of articulation, such as the frequency of the burst of noise following 
the release of the stop.n or the slope of F3, are absent or neutralized 
(Cooper et al. , 1952). A syllable In a series such as this will be heard as 
beginning with a labial, an alveolar, or a velar stop depending entirely on 
the slope of the F2 transition. This Is true even though the slope values 
appropriate for a particular stop consonant depend on the vowel: thus a rising 

F2 cues [d] before [1], and a falling F2, [d] before [u] (see the patterns In 
Figure 3). 

Phonetic distinctions other than place are signalled by other cues. 

Thus, in English, the cue separating the voiceless, aspirated stops [p,t,k] 
from the voiced stops [b,d,g] is voice-onset time (Liberman et al., 1958). 

If the beginning of glottal pulsing coincides with, or precedes, the release, 
the stop will be heard as [b], [d], or [g] , depending upon the cues to place 
of articulation; if the pulsing Is delayed 30 msec or more after the release, 
the stop will be heard as [?], [t], or [k] . Again, the duration of the for- 
mant transitions is a cue for the stop-semivowel distinction (e.g., [b] vs. 

[w] ) (Liberman et al., 1956). A shorter (3O-A0 msec) transition will be 
heard as a stop, whereas a longer (60-80 msec) transition will be heard as a 
semivowel. 

Some recent work indicates that human beings may possibly be born with 
knowledge of these cues. While appropriate Investigations have not yet been 
carried out for most of the cues, the facts with respect to voice-onset time 
are rather suggestive. Not all languages have this distinction between stops 
with immediate voice onset and stops with voice onset delayed after release, 
but for all those that do, the amount of delay required for a stop to be 
heard as voiceless rather than voiced Is about the same (Llsker and Abramson, 
1970; Abramson and Llsker, 1970). This constraint on perception thus appears 
to be a true language universal, and so likely to reflect a physiological 
limitation rather than a learned convention. 

Exploring the question more directly, Elmas et al. (1970), by monitor- 
ing changes in the sucking rate of one-month-old Infants listening to syn- 
thetic speech stimuli, showed that the Infants could distinguish signifi- 
cantly better between two stop-vowel stimuli which straddle the critical 
value of voice-onset time than between two stimuli which do not, even though 
the absolute difference in voice-onset time Is the same. Thus the information 
required to Interpret at least one speech cue appears either to be learned 
with Incredible speed or to be genetically transmitted. 

Sign stimuli, with which I propose to compare speech cues, have been 
defined by Russell (19A3), Tinbergen (1951), and other ethologists as simple, 
conspicuous, and specific characters of a display which under given conditions 
produces an "Instinctive" response: the red belly of the male stickleback, 

which provokes a rival to attack, or the zigzag pattern of his dance, which 
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Patterns for a Series of Stop— Vowel Syllables with Systematically 
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Note: F2 transitions with low starting points will cause the stop 
to be heard as [b] , those with high starting points as [g] , 
and those in between as [d]. 
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arouses the femls (Tinbergen, 1951)i the spots by which 
Identifies her eggs (Koehler and Zagarus, 1937); the red spot on the 
gull's bill, which makes her chicks beg for food (Tinbergen, 1J51). 
examples are visual, but sign stimuli are found In other modalities . 
e.g., the monotone note of the whlts-throated sparrow's song, by “ 

asserts hie territorial claims (Falls, 1969); or the chemical ^"^e blood 
from a wounded minnow, which causes other minnows to flee when they scent 
It in the water (Manning, 1967). Responding properly to sign stimuli is 
normally of great value for the survival of the 

As Manning (1967:39) comments, "Sign stimuli will usually be involved where 
it is Important never to miss making a response to the stimulus. it is 
this circumstance, perhaps, which accounts for the striking properties 
sign stimulus perception which we shall be mainly concerned with 
anLial responds not to the display in general but specifically to ^ be sign 
stimuli, and the strength of the response is in proportion to t^^® 
conspicuousness of the sign stimuli. The perception of a sign ^ 

the response it produces have been attributed by Lorenz (1935) to a special 

neural **innate releasing mechanisTn# 

The concepts of the sign stimulus and the Innate „ 

as used in early ethologlcal work, have come in for much “^^^5 

(e.g., Hallman, 1969; Hlnde, 1970). It has been argued that sign stimuli 
cannot be shown to differ in principle from other stimuli; that some pur 
ported sign stimuli are not actually specific to particular responses but 
merely reflect the general capabilities of the animal s sense °’^8ans 
associated perceptual equipment; that the word "Innate s®gg®®t8 too s^ple 
a dichotomy between nature and nurture; and that sign stimuli do not always 
lead to direct and immediate responses but Influence behavior in other ways. 

But when all these criticisms are taken into account, there remain some 
very striking phenomena. There are many cases in which a stimulus is selec- 
HvLy perceive by a particular spades and not by others. The «l“tlvlty 
cannot be accounted for simply by an appeal to the 8ei>«al aensory capablll 
ties of the species. The stimulus consistently elicits a direct response 
(or other specific behavior Indicating that the stimulus has been perceived, 
as in the case of orientation). This response is adaptive. Moreover, In 
many instances (and in all the examples given above) the stimulus is a clmr- 
acter of a display by a conspeciflc (or symbiotlcally related) individual, 

?he entire pattern of behavior, consisting of the display and the response. 

is adaptive. 

Displays of this latter sort have been called "social releasers" 
(Tinbergen, 1951:171). Their component sign stimuli elicit appropriate re- 
spouses from conspeciflc Individuals In sltuetlons important 
ty or for the Integrity and continuity of the species. 

iLlude: alarm calls; the "threat behavior' of many species, by ^blch the 

adaptive ends of sexual fighting are achieved with few actual 

the displays which serve reproductive isolating mechanics, encouraglg 

intraspecific and discouraging interspecific mating; and the 

parents and young identify each other, so that the latter are 

fed. In all these adaptively important situations, displays composed of sign 

stimuli serve to authenticate the conspeciflc ity of individuals. 
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It has also been suggested before that sign stimuli actually occur in 
human behavior. The facial characteristics and limb movements of babies 
evoke parental behavior (Tinbergen, 1951). Babies, in turn, respond to adult 
facial characteristics, notably to eyes and to smiles, and women have a uni- 
versal flirting gesture (Eibl-Eibesfeldt, 1970). I think that speech cues 
may also belong to the class of human sign stimuli, despite obvious differ- 
ences to be discussed shortly. But let us now consider the resemblances. 

First of all, the speech cues, like the sign stimuli, do not require 
a natural context, or even a naturalistic one; the appropriate response can 
be elicited by drastically simplified models of the natural original. Tin- 
bergen's sticklebacks would respond to an extremely crude model, provided 
only that it had a red belly, but disdained very naturalistic models which 
lacked this crucial feature (Figure 4) (Tinbergen, 1951:28). Lorenz (1954: 
291, translated by Eibl-Eibesfeldt, 1970:88) makes the general claim that 
"where an animal can be 'tricked' into responding to simple models, we have 
a response by an innate releasing mechanism." In the case of speech, most 
of the complexity of the spectrum can be dispensed with so long as the essen- 
tial cues are preserved. It has already been mentioned that the simple, two- 
formant synthetic utterances of Figure 2 are clearly heard by subjects as 
[b], [d], etc. The natural and synthetic utterances in Figure 1 are ling- 
uistically equivalent, even though in the latter only the lower formants 
appear, and these in a very stylized configuration. 

The synthetic utterance is not, however, simply an acoustic cartoon of 
the natural utterance. Though it shares with a cartoon the appearance of 
extreme simplicity and emphasis of salient features, it is rather a system- 
atic attempt to represent, consistently but exclusively, the essential 
acoustic cues, all other details of the signal being discarded or neutralized. 
The principal loss in such synthetic speech is not intelligibility but only 
naturalness. This is rather surprising. One might quite reasonably expect 
that Intelligibility would depend crucially on naturalness, that tampering 
with the observed spectrum of a natural utterance to any degree would alter 
its linguistic value or cause it not to be perceived linguistically at all. 

I do not mean to imply that high-quality natural speech would not be more 
intelligible than synthetic speech, or that sticklebacks would not respond 
more strongly to a real stickleback with a red .belly than to a dummy. In 
synthetic speech, a host of redundant minor cues, as yet unidentified, are 
no doubt sacrificed together with the linguistically irrelevant details of 
the signal. Similarly, in the construction of the dummy, sign stimuli of 
minor Importance have been Ignored. But it appears that the dependence of 
si^bificlal speech cues and sign stimuli on a naturalistic context is very 
small. Though the listener and (for all we know) the stickleback may be 
quite aware of the lack of naturalness, neither one appears to be disturbed 
by it. The relative naturalness of the speech cues and sign stimuli them- 
selves is something else again, as will be seen shortly. 

Both speech cues and sign stimuli exhibit what Tinbergen (1951:81), 
translating Seitz (1940), calls "the phenomenon of heterogeneous summation." 
The same response can be elicited by separate and noninteracting sign stimuli: 
thus, either the redness of the patch on the herring gull's bill or the con- 
trast of the patch with the rest of the bill release the chick's pecking re- 
sponse. Moreover, if two stimuli for the same response are present, but one 
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stickleback Models Used by Tinbergen 



Note: 
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The fairly realistic model marked N, which lacked a red belly, provoked 
attack by male sticklebacks much less than the various crude models 
labeled R, which have red bellies. (After Tlngergen, 1951.) 
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is defective, the second will compensate for the deficiency of the first. 

A similar principle operates in speech perception. Multiple cues for the 
same phonetic feature are the rule. For example, point of articulation in 
stop consonants is cued not only by the F2 transition but also by the F3 
transition and by a burst of noise at an appropriate frequency just after 
release of stop closure (Delattre et al. , 1955; Halle et al. , 1957; Harris 
et al. , 1958). In medial position, a voiced rather than a voiceless stop 
is cued by low-frequency periodic energy during closure, by lesser duration 
of closure, and by greater length of the preceding vowel (Lisker, 1957). 
Furthermore, the perceptual weight of one cue appears to be independent of 
that of the others; all combine additively to carry a single phonetic dis- 
tinction; if a cue is defective or absent, as is very often the case in 
natural speech, the deficiency is compensated for by the presence of other 
cues. Thus Hoffman (1958) compared perception of point of articulation for 
(a) synthetic stop-vowel syllables in which all three cues (burst, F2 transi- 
tion, F3 transition) were present, (b) syllables in which the burst cue was 
absent, (c) syllables in which the third formant with its transition was 
absent, and (d) syllables in which both third formant and burst were absent 
and only the F2 transition was present. He found that the optimal version 
of a cue for a particular point of articulation is the same whether presented 
separately or in combination with other cues; that labeling is most consist- 
ent when all three cues are optimal for the same point of articulation; and 
that an optimal F3 transition would compensate for a nonoptimal burst cue, 
and conversely. A.M. Liberman (personal communication) points out that speech 
also carries multiple cues to the sex of the speaker: men's voices differ 

from women's both in pitch range and in formant frequency range. Thus, 
neither the perception of speech cues nor that of sign stimuli is a Gestalt 
(Hinde, 1970). 

An optimal speech cue is often not a realistic one; such a cue is the 
analog of a "supernormal" sign stimulus, such as the pattern of black spots 
on a white background on the artificial egg (see Figure 5) which the plover 
prefers to a natural egg with dark brown spots on a light brown background 
(Koehler and Zagarus, 1937). "The natural situation," Tinbergen (1951:44) 
observes, "is not always optimal." Similarly, if a human subject is presented 
with stimuli like those represented in Figure 2, he will hear the first few, 
those with rising transitions, as [bfc]. The stimuli with the loss steeply 
sloping transitions are closer to what one observes in instances of [bg] in 
natural speech, while the more extreme transitions are unlikely, perhaps even 
articulatorily impossible. Yet, in a labeling test, the more steeply rising 
the F2 transition, the more likely is the subject to hear [b£]. Thus the 
subject will label more consistently not only when more cues are present but 
also when the cuts present are more nearly optimal, i.e., supernormal. Again, 
vowels spoken in isolation will occupy more extreme positions on the F1-F2 
plane than vowels in connected speech (Shearme and Holmes, 1962) and are 
easier to label than the "same" vowels excised from connected speech. As 
Manning (1967) says, the failure of a sign stimulus to evolve to the super- 
normal extreme can usually be explained by considering other functional 
requirements. Thus the low-contrast, brown- on- brown spotting of the plover's 
eggs also serves to camouflage them from predators; black on a white back- 
ground would not be so effective. The vocal tract, likewise, is primarily a 
group of devices for breathing and eating. A vocal tract which produced 
supernormal formant transitions and extreme vowels at normal speech rates 
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The Supernormal Plover Egg with Black Spots on a White Background (at left) 
Preferred by the Plover to. the Normal Egg with Dark Brown Spots on a 
Light Brown Background (at right) 




(After Koehler and Zagarus, 1937, reproduced In Tinbergen, 1951.) 



98 



I 

I 

Fig. 5 




t 









would probal)ly be unable to perform these primary functions properly. What 
Is more Interesting, as Manning goes on to point out, Is that the tendency 
to respond to the sign stimulus has not evolved so as to be perfectly adjust- 
ed to the naturally occurring form of the stimulus. Like heterogeneous sum- 
mation, this must reflect a characteristic of the process by which sign 
stimuli are perceived, and speech perception must share this characterlaLic. 
When we listen to natural speech, presumably we respond best to that combi- 
nation of cues which approaches the supernormal Ideal most closely. Thorpe 
(1961:98), similarly, has observed that the best natural sign stimulus dis- 
play Is the one which "can come nearest to the supernormal for the largest 
number of constituent sign stimuli." 

Finally, since the validity of the concept of a specialized neural 
mechanism to account for the selective perception of and response to sign 
stimuli Is In dispute, the possibility that some such mechanism operates In 
speech perception Is of special Interest. The properties which speech per- 
ception have In common with sign stimuli point In this direction, for they 
are not characteristic of human auditory perception In general; so does the 
possibility of genetic transmission of knowledge of the cues. There Is also 
some other evidence. If we ask a subject to discriminate pairs of stimuli 
which are adjacent along the acoustic series of stop-vowel syllables with 
varying F2 transition (Figure 2), he will do very well near the boundaries 
Implied by the cross-over points In his labeling functions and very poorly 
elsewhere. The upper part of Figure 6 shows the labeling functions of a 
typical subject; the lower part (solid line) shows his discrimination func- 
tion for the syllables. He Is discriminating categorically (Liberman, 1957). 
Discrimination of this kind Is quite unusual in psychophysical tasks. If we 
now give the subject a similar discrimination task In which the stimuli are 
chirps," l.e. , F2 transitions In Isolation, without FI or the steady-state 
^2 (Figure 7), his discrimination function, represented by the 
dashed line In the lower part of Figure 6, Is quite different. He discrim- 
inates better than random for most of the series, but. the peaks of the syl- 
lable discrimination function are absent. Without a context containing other 
speech cues, the F2 transition Is heard quite differently: there Is no In- 

dication of categorical perception, and the function Is more typically 
psychophysical (Mattingly et al., 1971). 

Additional evidence for a special mechanism comes from experiments In 
dlchotlc presentation of speech sounds. If different stop-vowel syllables 
are simultaneously presented to a subject’s two ears, he will be able to re- 
port correctly the stimuli presented to the right ear more often than the 
stimuli presented to the left ear. The effect Is attributed to the process- 
ing of speech In the left cerebral hemisphere (Kimura, 1961; S t udder t-Kennedy 
and Shankweller, 1970). No such right-ear advantage Is found with nonspeech 
signals such as musical tones (Kimura, 1964). Experiments by Conrad (1964), 
Wlckelgren (1966), and others suggest that the speech perception mechanism 
Is somehow Involved with, and perhaps Includes, "short-term memory." 

To recapitulate, speech cues have a number of perceptual properties In 
common with sign stimuli. Their perception does not require a naturalistic 
context, they obey the law of heterogeneous summation, they are more effec- 
tive as they approach a supernormal Ideal, and there Is reason to suppose 
that a special neural mechanism Is Involved. Some of these formal properties 
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Labeling and Discrimination Functions for One Subject for the 
Series Synthetic Speech Syllables Shown In Figure 2. 
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The Same Subject's Discrimination Function for the Series of 
"Chirps" Shown In Figure 7. 
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appear in other situations — heterogeneous summation is a property of human 
binocular vision for instance — but it is their co-occurrence in both speech 
and sign stimuli that I find compelling. These properties are shared by 
the sign stimulus systems of many species, presumably for functional rather 
than for phylogenetic reasons. Thus, we are led to ask whether speech is in 
some way functionally similar to a sign stimulus system. But before consid- 
ering this point, we ought to mention certain rather obvious differences 
between sign stimuli and the speech cues. 

First the speech cues are transmitted at a rate much higher than the 
sign stimuli of any animal system. The displays in which sign stimuli occur, 
if not virtually static, are either relatively slow-moving or highly repeti- 
tive. But the acoustic events of speech which serve as cues occur extreni’ly 
rapidly. The speech-perceiving mechanism not only keeps up with these events 
but is capable, as experiments with speeded speech have demonstrated, of 
speeds more than three times greater than normal speaking rates (Orr et al. , 
1965). A further gain in transmission speed is obtained by "parallel 
processing": the speaker produces and the listener extracts cues for differ- 

ent phonetic .distinctions more or less simultaneously from the same acoustic 
activity (Liberman et al. , 1967). Thus in a consonant-^fowel syllable, the 
slope of the transition will carry information about the place of articula- 
tion of a consonant, its manner class (stop, fricative, semivowel) and about 
the quality of the vowel, while the excitation of these same transitions will 
cue the voicing distinction. The information rate of speech can be as high 
as 150 bits/second, and the question of the adaptive value of such a high 
rate arises. 

Another difference between speech cues and sign stimuli is implicit in 
our use thus far of such terms as "place of articulation." Although the 
speech cues are acoustic events, the phonetic distinctions perceived by the 
listener are not acoustic but articulatory. Thus, the cues for, say, the 
alveolar sounds [t,d] — a high-frequency burst, an F2 transition which has a 
locus at about 1800 Hz, and an F3 transition with a locus at 3200 Hz — seem 
like a highly arbitrary selection if they are regarded as purely acoustic 
events. Moreover, the events do not occur synchronously; and, as we have 
just noted, they are Interspersed with cues for other phonetic distinctions. 
But if these same events are interpreted as acoustic correlates of the simple 
articulatory gesture which produces [t,d], both the selection of events them- 
selves and their relative timing appears quite straightforward. Another in- 
dication of the articulatory reference of the cues is that a series of stim- 
uli may be perceived as belonging to -the same phonetic category, even though 
they are not neighbors on an acoustic continuum, but they must not fail to 
be close together on some articulatory continuum. Thus the series of stimuli 
heard as [d] before vowels ordered from high front to low back form both an 
articulatory and an acoustic continuum, defined (though in somewhat oversim- 
plified fashion) by the [t,d] locus (see the upper portion of Figure 8). 

But In the case of [k,g] the acoustic continuum is Incomplete because the 
concept of the locus falls to apply consistently; the locus for [k,g] with 
low back vowels appears to be much lower and less clearly specifiable than 
for high front vowels (lower portion of Figure 8). Yet the perception is 
constant because the articulation is similar (Liberman, 1957). Conversely, 
the series of stimuli in Figure 2, which do form an acoustic continuum, di- 
vides into [b,d,g,] because the articulatory reference changes abruptly at 
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two points on the continuum. Because of such phenomena, it seems reasonable 
to regard speech as an acoustic encoding of articulatory gestures, or rather 
of the motor commands underlying those gestures (Llsker et al., 1962; Liber- 
man et al. , 1963; Studdert- Kennedy et al. , 1970). We may call the sequence 
of motor commands which determines the speaker’s output the "phoneuic repre- 
sentation." The listener, because of his intuitive knowledge of the speech 
code, can recover this representation. 

The most notable difference between speech cues and sign stimuli is 
that while sign stimuli typically produce a stereotyped behavioral response, 
speech cues do not . The reason the response to speech is not stereotyped is 
of course that unlike sign stimulus displays, a phonetic representation has 
no fixed significance apart from the linguistic system in which it functions; 
in itself it is a meaningless pattern, related only quite indirectly to the 
semantic values of the speakers and hearers. Speech does not stand by itself; 
it functions as part of language. The meaning of an utterance and the nature 
of the ultimate behavioral response depend not Just on the characters of the 
stimulus, the environmental context, and the Internal state of the percelver, 
but also upon something not found In conjunction with any set of sign stlmull- 
a grammar. By virtue of a system of grammatical rules, shared by speaker and 
hearer, the speaker can evoke not just a few stereotyped responses but a wide 
variety, many of which are delayed or covert, and In principle, an infinite 
range of semantic values can be expressed. The problem is to explain why and 
how such a powerful system should have evolved. 

It is with this problem that most attempts to find precedents for human 
language in animal behavior have begun. The cries of animals grossly resem- 
bling man, as well as animal communications systems which transmit a substan- 
tial amount of information even though the physical nature of the signals may 
be very different from human speech, have been scrutinized by many investiga- 
tors for linguistic properties. These efforts have consistently failed. The 
properties treated as linguistic by some investigators have been so abstract— 
for example, the Hockett-Altmann "design features" (Altmann, 1967; Hockett and 
Altmann, 1968) — that those characteristics which distinguish language from 
purposive behavior in general are lost to view (Chomsky, 1968:60) and really 
fundamental features are placed on a level with trivial ones. Thus Hockett 's 
Design Feature 3, "Rapid Fading," a property shared by all acoustic phenomena, 
is apparently just as Important as DF 13, "Duality of Pa:ternlng," which, as 
we shall see, is truly significant. It is perhaps noteworthy that, according 
to Hockett and Altmann, the stickleback's communication system, which is of 
great interest from the viewpoint adopted here, lacks most of the linguistic 
Design Features. 

% 

Other investigators have tried indiscriminately to force the phenomena 
of animal behavior into standard linguistic categories. In Lenneb'^^rg's (1967: 
228) words, they have attempted 

to count the number of words in the language of gibbons, to look 
for phomemes in the vocalizations of monkeys or songs of birds, 
or to collect the morphemes in the communication systems of bees 
and ants. In many other instances no such explicit endeavors 
are stated, but the underlying faith appears to be the same 
since much time and effort is spent in teaching parrots, dolphins 
or chimpanzee infants to speak English. 
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Such efforts, I think, are doomed to failure, and those who have Insisted 
most strongly on the "biological basis of language" — Chomsky and Lenneberg— - 
share this view. Chomsky (1968:62) suggests that human language "is an 
example of true 'emergence' — the appearance of a qualitatively different 
phenomenon at a specific stage of complexity of organization*" Lenneberg 
(1967) believes that language has for the most part evolved covertly. In his 
view, we cannot expect that the steps In the evolution of a characteristic A 
from some quite different characteristic ^ will necessarily be manifest. The 
nature of the process of genetic modification Is such that the Intervening 
steps must In many cases remain obscure. This, he suggests, is the case with 
human language. While Lenneberg' s general position on the nature of evolu- 
tion may well be essentially correct, to take refuge In this position In the 
case of a particular evolutionary problem, such as the origin of human lan- 
guage, Is essentially to abandon the problem. ^ 

Despite the lack of precedents for grammar, I think that Chansky and 
Lenneberg are perhaps unduly pessimistic and that the parallels between the 
speech cues and the sign stimuli suggest some interesting speculations about 
the origins of language. 

One of the traditional explanations of language Is that It developed 
from cries of anger, pain, and pleasure (see, e.g., Rousseau, 1755). The 
^iifficulty with this explanation Is that It does not attempt to account for 
the transition from cries to names, or for the emergence of grammar. But let 
us put these problems to one side for the moment and postulate, just as the 
traditional explanation does, a stage In man's evolution when speech existed 
Independently of language. Such speech, we suppose, had no syntax or seman- 
tics. But It was more than just expressive because It had phonetic structure. 
Its utterances were phonetic representations encoded by acoustic cues. If we 
ask what function such prellngulstlc but structured speech could have had, 
the parallels we have discussed between speech cues and sign stimuli suggest 
a possible answer. Since speech Is Intraspeclf Ic, we- suggest that It may 
have been, at this stage of evolution, a social releaser. ' If this specula- 
tion Is correct, prellngulstlc speech may have served early man as a vehicle 
for threat behavior, as a reproductive Isolating mechanism, and as a means 
for mutual recognition of human parents and offspring. By means of phonetic 
represrntatlons underlying his utterances, man elicited appropriate behav- 
ioral resjponses from his fellows In each of these crucial situations. It is 
probably pointless to speculate as to what particular phonetic representations 
evoked what responses, but It perhaps reflects the primitive function which we 



Even If precedents for grammar existed In animal communication, it would be 
very difficult to learn about them. Most of what we know of the grammatical 
aspects of human language we know not from observations of human behavior 
but by virtue of our special status as members of the human species. The 
work of the linguist depends on the availability to him of the intuitions 
of speakers of a language that, certain utterances are, or are not, grammatical. 
A member of another species, however Intelligent, would find It difficult 
to deduce the most elementary grammatical concepts by observing and manip- 
ulating behavior; he would have, somehow, to consult the grammatical Intui- 
tions of a human speaker. We are similarly at a loss when speculating about 
the possible grammars of animal communication systems. 

:: . 



have attributed to speech that while the segmental aspects of speech have 
been adapted for linguistic purposes, the prosodic features remain as a pri- 
mary means of physically harmless fighting, of courting, and of demonstrating 
and responding to parental affection. 

If speech was once a social releaser system, we should expect It to show 
adaptation in the direction of "communications security." While being as 
conspicuous as possible on appropriate occasions to conspeclflc individuals, 
social releasers should be otherwise as inconspicuous as possible, in partic- 
ular to prey and to predators. In the case of visual releasers, various 
camouflaging arrangements are found: outside the courtship period, the 

stickleback changes the color of his belly to a less noticeable shade and 
birds hide their brilliant plumage (Tinbergen, 1951). In the case of acous- 
tic releasers, the animal can become silent when this is expedient; the sim- 
plicity of this solution is the great advantage of acoustic systems. As for 
speech, two of the differences we have noted between sign stimuli and speech 
cues are probably to be interpreted as further adaptations In the direction 
of security. The rapid rate at which the speech cues can be transmitted 
means that when necessary, transmissions can be extremely brief, making It 
so much the more difficult for an enemy to locate the source of the signal. 

And the fact that the articulatory information conveyed by speech can be 
perceived only by man means that, from the standpoint of other animals, as 
Hockett and Altmann (1968) point out, human speech is quite literally a code, 
concealing not only the phonetic representation but also the fact that there 
is such a representation and that the speaker Is human. Presumably the ani- 
mals man preyed upon would not have been able to distinguish his speech from 
the chatter of herbivorous nonhuman primates. 

Moreover, If we regard speech as a social releaser system, a natural 
explanation Is available for an old problem. The fact that no other animal 
except man can speak, not even the primates to whom he is most closely re- 
lated, has long been a cause for wonder and speculation. But, of course, a 
social releaser is required, almost by definition, to be species-specific: 

It must be so If it Is to perform its authentication function effectively. 

It is thus no more surprising that speech should be unique to man than that 
zigzag dances should be unique to sticklebacks. 

Let us now consider how the concept of prellngulstlc speech as consist- 
ing of a system of phonetic social releasers bears on the problem of the 
origin of language. Most speculations on this topic suppose that man's un- 
usual intelligence must have been the principal factor in the development of 
language. The weaker version of this view (which would have been that of 
many post-Bloomf leldlan linguists) assumes that man's intelligence differs 
from that of animals in degree: he alone is intelligent enough to divide the 

world into its semantic categories and to recognize their predicative rela- 
tionships. The structure of his language. Insofar as it Is not purely a 
matter of convention, reflects the structure of human experience. The strong- 
er version of this view (which I think It is fair to attribute to Chomsky and 
his colleagues) assumes that man's intelligence differs in kind from that of 
other animals and that the structure of language, properly understood, re- 
flects specific properties of the human Intellect. Speech, according to 
either version, serves simply as the vehicle for the abstract structure of 
language. The anatomy of the vocal tract Imposes certain practical constraints 
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on linguistic behavior but has only a trivial relationship to linguistic 
structure. 

The difficulty with this view is not only that it makes no attempt to 
account for the choice of speech as the vehicle of language, but also that 
many animals display some degree of intelligence, and a few display intelli- 
gent behavior comparable in some ways to man’s. One would expect to find 
some limited linguistic behavior among animals of limited intelligence, or 
something approximating human linguistic behavior among animals whose intel- 
ligence seems to resemble man's. But, as we have seen, precedents of any 
kind are lacking, and it is argued that language is an instance of evolu- 
tionary "emergence . " 

I wish to suggest a somewhat less drastic alternative to emergence. 

This is that language be regarded as the result of the fortunate coexistence 
in man of two Independent mechanisms: an Intellect, capable of making a 

semantic representation of the world of experience, and the phonetic social- 
releaser system, a reliable and rapid carrier of information. From these 
mechanisms a method evolved for representing semantic values in communicable 
form. 



Before this could happen, a means had to be found for. the speaker-hearer 
to recode semantic representations into phonetic representations, and phonetic 
representations into semantic representations. Clearly this recoding Is a 
complex process, if only because the intellect, being capable of representing 
a wide range of human experience, probably has a very large number of cate- 
gorical features available for semantic representations In long-term memory, 
while the phonetically significant configurations of the vocal tract car be 
described in terms of a very small number of categorical features — fifteen or 
twenty at most (Chomsky and Halle, 1968). It would thus be impossible to ac- 
complish the recoding simply by mapping semantic features onto phonetic fea- 
tures. It was necessary for another mechanism to evolve: linguistic capacity, 

the ability to learn the grammar of a language.^ The grammar is a descrip- 
tion of the complex but rule-governed relationships, in part universal, in 
part language-specif ic, which obtain between semantic representations and 
phonetic representations. By virtue of his grammatical competence, a person 
can speak and understand utterances In the language according to the rules of 
grammar . ^ 



In this discussion, I have ignored for simplicity's sake the obvious fact 
that there are not one but many languages, each with its own grammar. To 
Rousseau (1755) and von Humboldt (1836) , to explain the diversity of human 
languages was regarded as a problem second in Importance only to that of 
explaining the origin of language. Recently, Nottebohm (1970) has offered 
the intriguing suggestion, based on an analogy with bird song, that lan- 
guage diversity enables some members of a species to develop traits appro- 
priate to their particular environment without an irreversible commitment 
to subspeciation. 

The account of the organization of grammar given here, necessarily over- 
simplified, is based on Chomsky (1965, 1966). 
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One component of the gr amma r is the lexicon, a list of morphemes with 
which semantic, syntactic, and phonological Information is associated. The 
stock of morphemes in a language is large but finite, while the number of 
conceivable semantic representations is infinite. But an infinite number of 
grammatical strings of morphemes can be generated by the syntactic component 
of the grammar, and from these, the semantic component can generate a corre- 
spondingly infinite number of semantic representations. The phonological 
component parallels the semantic component: for each string of grammatical 

morphemes, a phonetic representation can be generated. The speaker's task 
is thus to find a phonetic representation which corresponds grammatically to 
a given semantic representation, while the hearer's task is to find a seman- 
tic representation corresponding to a given phonetic representation. In 
both his roles, the speaker-hearer, in order to recode, must determine heu- 
ristically the probable input to a gramnatical component, given its output 
and the rules which generate output from input. Very little is known about 
how he performs these tasks. 

For our purposes, however, the important point is that a grammar has an 
obvious symmetry. There is a core, the syntactical and lexical components, 
and two other components, the semantic and the phonological, which generate 
the semantic and phonetic representations, respectively. The nature of the 
semantic component, and the representation it generates, appear to be appro- 
priate for storage in long-term memory. The nature of the phonological 
component, and the representation ^ generates, are appropriate for on-line 
transmission by the vocal tract. To relate these two representations is the 
main motivation of the grammar, and its form is determined both by the prop- 
erties of the Intellect and by those of the phonetic soclal-releaser system. 

It is thus surely not correct to view speech as if it were merely selected 
by happenstance as a convenient vehicle for language. 

Once the grammar had begun to develop, we should not be surprised to 
find that it exercised a reciprocal influence on the development both of the 
phonetic system and of the Intellect. In the case of the former, it has been 
argued very persuasively (Lleberman et al. , in press; Lieberman and Crelin, 
1971) that the vocal tract of modern man has evolved from something rather 
like that of a chimpanzee to its present form, with a shorter jaw, a wider 
and deeper pharynx, and vocal cords for which the tension is more finely con- 
trolled, and that these modifications not only have no other discernible 
adaptive value than to Increase the reliability and the richness of struc- 
ture of human speech but are actually disadvantageous for the vocal tract's 
primary functions of chewing, breathing, and swallowing. If man's vocal 
tract has evolved in this way, corresponding modifications must have taken 
place in the neural mechanisms for production and perception of speech, re- 
sulting in the speech code in the form we now know it. The evidence for the 
development and specialization of the huniiin intellect as a result of its 
grammatical affinities is, of course, far less concrete, but the very least 
that can be said Is that the capability of symbolizing things and ideas by 
words permits a degree of conceptual abstraction without which the kind of 
thinking which human beings regularly do would be impossible. 

If the function of a grammar is to serve as an Interface between the 
phonetic and semantic domains, it is hardly' surprising that precedents for 
linguistic behavior have not been found. The speech production and perception 



system is a highly specific mechanism; so also is the human intellect. Their 
co-occurrence in man was a remarkable piece of luck; other animals, which on 
behavioral or physiological grounds appear to be of high intelligence, had no 
opportunity to develop language because they lacked a suitable pre-existing 
communications system. Moreover, even if high intelligence and an appropriate 
communications system had co-occurred in some other species and combined to 
form a "language," its grammar would be utterly different in form from any 
human grammar, because the Intellectual and communicative mechanisms from 
which it evolved would be quite different in detail from the corresponding 
human mechanisms. In the circumstances, the most we can hope for is to under- 
stand more about the separate evolution of the Intellect and that of the 
speech code and to interpret human grammars in terms of their dual origin. 



To summarize, I have called attention to certain parallels between the 
speech cues and sign stimuli. These parallels suggest the speculation that 
P^®li^^8^istic speech may have functioned as a social— releaser system, which 
would explain the fact that speech is species— specific. It is suggested, 
furtheirmore, that human language is not simply the product of the human intel- 
lect but is rather to be viewed as the joint product of the intellect and of 
this prellngulstlc communications system. Grammar evolved to interrelate 
these two originally independent systems. Its dual origin explains the lack 
of precedents for language in animal behavior and its apparent "emergence." 
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